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Abstract 

In this paper, we propose a novel method to find charac¬ 
teristic landmarks on ancient Roman imperial coins using 
deep convolutional neural network models (CNNs). We for¬ 
mulate an optimization problem to discover class-specific 
regions while guaranteeing specific controlled loss of ac¬ 
curacy. Analysis on visualization of the discovered region 
confirms that not only can the proposed method success¬ 
fully find a set of characteristic regions per class, , but also 
the discovered region is consistent with human expert an¬ 
notations. We also propose a new framework to recognize 
the Roman coins which exploits hierarchical structure of 
the ancient Roman coins using the state-of-the-art classi¬ 
fication power of the CNNs adopted to a new task of coin 
classification. Experimental results show that the proposed 
framework is able to effectively recognize the ancient Ro¬ 
man coins. For this research, we have collected a new Ro¬ 
man coin dataset where all coins are annotated and consist 
of observe (head) and reverse (tail) images. 


1. Introduction 

The ancient Roman coins have not only bullion values 
from precious materials such as gold and silver, but also 
they provide people with beautiful and historical arts of re¬ 
lief. They were first introduced during the third century BC 
and continued to be minted well across Imperial times. The 
major role of the Roman coins was to make an exchange 
of goods and services easy for the Roman commerce. An¬ 
other important role in which researchers in numismatics 
have been interested is to convey historical events or news 
of the Roman empire via images on the coins. Especially, 
the Roman imperial coins were used to provide political 
propaganda across the empire by engraving portraits of the 
Roman emperors or important achievement of the empire. 
As the Roman imperial coins are closely connect to the his¬ 
torical events of the empire, they could serve as importance 
references to understand the history of the Roman empire. 



observe reverse 

(a) Domitian RIC 740 



observe reverse 

(b) Domitian RIC 921 


Figure 1: Sample observe and reverse images of two an¬ 
cient Roman imperial coins. Both coins depict the same 
emperor (Domitian) on the observe side but have distinct 
reverse depictions, resulting in different Roman Imperial 
Coinage (RIC) labels. The descriptions for them are (a) 
Observe: Laureate head right , Reverse: Minerva standing 
right on capital of rostral column with spear and shield to 
right owl , and (b) Observe: Laureate head right , Reverse: 
Pegasus right. 


In this paper, we aim at automatically finding visual 
characteristics of the ancient Roman imperial coins which 
make them distinguishable from the others, as well as rec¬ 
ognizing their identities. To achieve these goals, we col¬ 
lected Roman imperial coin images with their descriptions. 
We used the Roman Imperial Coinage (RIC) [2 ] to an¬ 
notate the collected coin images. RIC is a comprehensive 
numismatic catalog of Roman imperial currency which is 
the results of several decades of work. The RIC provides a 
chronological catalog of the coins from 31 BC to 491 AD 
with description of both the obverse (head) and reverse (tail) 
sides of the coin. Figure 1 shows example observe and re¬ 
verse images and their descriptions. For the purpose of the 
classification, we use the catalog number of RIC as a label 
to predict. 

Automatic methods to identify the ancient coins have 
been attracted as a growing number of the coins are being 
traded everyday over the Internet [4, 12]. One of the main 
issues in the active coin market is to prevent illegal trade and 
theft of the coins. Traditionally, coin identification depends 
on manually searching catalogs of coin markets, auctions or 
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the Internet. However, it is impossible for the manual search 
to cover all trades because the coin market is very active, for 
example, over a half million coins are traded annually only 
in the north American market [12]. Therefore, automatic 
identification of the ancient coins becomes significant. 

Several works on coin classification have appeared in 
the computer vision field. Some proposed methods use the 
edge detection of the engraved image on the coin [23, 26]. 
Others represent the coin images as local features such as 
SIFT [20] and perform the classification [14]. Methods 
using the spatial pyramid models [3] and orientations of 
pixels [4] are proposed to exploit the spatial information. 
Aligning coin images using the deformable part models has 
refined the recognition accuracy over the standard spatial 
pyramid models [1 ]. 

In this paper, we propose an automatic recognition 
method for the Roman imperial coins using the convolu¬ 
tional neural network models (CNNs). Recently, the CNN 
models have shown the state-of-the-art performance in vari¬ 
ous computer vision problems including recognition, detec¬ 
tion and segmentation [8, 11, 27, 28], driven by the increas¬ 
ing availability of large training dataset and the improve¬ 
ment of the computational power of the GPUs. In this paper, 
we propose a hierarchical framework which employs the 
CNN models for the coin classification tasks by fine-tuning 
a pre-trained CNN model on the ImageNet dataset [7]. 

Second, we propose a novel method to find character¬ 
istic landmarks on the coin images. Our method is moti¬ 
vated by class saliency extraction proposed in [24] to find 
class-sensitive regions. In this paper, we formulate an opti¬ 
mization problem so that a minimal set of the parts on the 
coin image will be selected while the chosen set is still be 
recognized as the same category as the full image by the 
CNN models. We consider the chosen parts are deemed the 
persistent, discriminative landmarks of the coin. Such land¬ 
marks can be critical for analysis of coin features by domain 
experts, such as numismatists or historians. 

The contributions of the paper can be highlighted as 
follows: 1) a new coin data set where all the coins have 
both observe (head) and reverse (tail) images with annota¬ 
tions, 2) a new framework of recognizing the Ancient Ro¬ 
man coins based on the CNNs, 3) a new optimization-based 
method to automatically find characteristic regions using 
the CNNs while guaranteeing specific controlled loss of ac¬ 
curacy. 

2. Related Work 

There have been several methods to recognize coins in 
the computer vision field. Bag-of-words approaches with 
extracted visual descriptors for the coin recognition were 
proposed in [2, 3, 4, 15]. A directional kernel to consider 
orientations of pixels [4] and an angle histogram method [2] 
were proposed to use the explicit spatial information. In [3], 


rectangular spatial tiling, log-polar spatial tiling and circu¬ 
lar spatial tiling methods were used to recognize the ancient 
coins. Aligning the coin images by the deformable part 
model (DPM) [9] further improves the recognition accuracy 
over the standard spatial pyramid model [15]. In this paper, 
we use the CNNs which exploit the spatial information by 
performing the convolution and handle the displacement of 
the coin image by performing the max-pooling. 

The Roman imperial coin classification problem can be 
formulated as the fine-grained classification as all coins be¬ 
long to one super class, coin. To identify one class from the 
other looking-similar classes, which is one of the challenges 
in the fine-grained classification, people have conducted re¬ 
search on the part-based models so that objects are divided 
into a set of smaller parts and classification is performed by 
comparing the parts [5, 10]. However, those methods re¬ 
quire annotated part labels while training, which takes an 
effort to obtain. In this paper, we investigate an automatic 
method to find discriminative regions on the coins which 
does not depend on human’s effort. 

With the impressive performance of the deep convolu¬ 
tional neural network models, a lot of papers have been pro¬ 
posed to understand why and how they perform so well and 
give insight the behaviors of the internal layers. The decon- 
volutional network [27] visualized the feature activities in 
the intermediate layers of the CNN models by mapping fea¬ 
tures to pixels in the reverse order. A data-driven approach 
to visualize the receptive field of the neuron in the network 
was proposed in [30]. The method in [3C] is based on the 
exhaustive search using the sliding-window technique and 
measures the difference between presence and absence of 
one window on the coin image. In [24], they propose an 
optimization method to reconstruct a representative image 
of a class from an empty image by calculating a gradient of 
the CNN model with respect to the image. In this paper, we 
propose a novel method to find discriminative landmarks 
of the coin image by formulating on optimization problem. 
Unlike [30] which requires exhaustive CNN evaluations for 
the sliding windows, our method effectively finds a set of 
discriminative regions by performing the optimization. 

3. Proposed Method 

In this section, we first describe how to train our convo¬ 
lutional neural network model for the task of the Roman im¬ 
perial coin classification. Then, we propose a novel method 
to discover characteristic landmarks which make one coin 
distinguishable from the others. 

3.1. Training Convolutional Neural Network for 
Coin Classification 

The convolutional neural network (CNN) is the most 
popular deep learning method which was heavily studied 
in 1990s [18]. Recently, a large amount of labeled data 


and computational power using GPUs make it possible that 
the convolutional network becomes the most accurate ob¬ 
ject classification method [16]. 

Let *S' c (x) be the score of class c for input x, which is 
fed to a classification layer ( e.g., 1000-way softmax layer in 
[16]). Assuming that the softmax loss function is used, the 
loss function £ c of the CNNs can be defined as: 


4(x;w) = - log 


( exp (ff c (x; w)) \ 

\Ec' ex P(&'( x ;w)),/ ’ 


( 1 ) 


where w are the weights of the complex, highly structured, 
deep CNNs. Then, stochastic gradient descent is used to 
minimize the loss £ c by computing gradient with respect to 

w as d£ c /dw. 

Although the CNN models are successful when there ex¬ 
ists large amount of labeled data, they are likely to per¬ 
form poorly on small datasets because there are millions 
of parameters to be estimated [29]. To overcome the limita¬ 
tion on the small data, a method to finetune the pre-trained 
model for new tasks was proposed, having shown success¬ 
ful performance [19, 28, 2 ]. In the fine-tunning method, 
we only need to change the softmax layer (which is the usu¬ 
ally the last layer of the CNN models) appropriately for the 
new task. 

Considering the number of the coin images in our dataset 
(about 4500), the CNN model is likely to be under-fitted 
if we train it only on the coin dataset even if we use the 
data augmentation method [16]. Therefore, we train a deep 
convolution neural network (CNN) model in the fine-tuning 
manner. To achieve the goal, we adopt one of the most pop¬ 
ular architecture proposed by Krizhevsky et al. [16] which 
is pre-trained on the ImagetNet with millions of natural im¬ 
ages. Specifically, we change the softmax layer of [16] 
for our classification purpose, and then finetune the covolu- 
tional network under the supervised setting. When training, 
we resize the original coin image to 256 x 256 and randomly 
crop a sub region of size 224 x 224 as the data augmenta¬ 
tion discussed in [16]. When testing, we crop the center 
of the coin. We use the open-source package Caffe [13] to 
implement our CNN model. 

3.2. Hierarchical Classification 


Each coin in our dataset has both observe (head) and re¬ 
verse (tail) images. A straight-forward method to use both 
images is to feed them together to classifiers (e.g., SVM 
or CNN) when training. In this paper, we exploit a hier¬ 
archical structure of the Roman imperial coins. One Ro¬ 
man emperor includes several RIC labels as shown in Fig¬ 
ure 1 while one RIC label belongs to exactly one emperor. 
Therefore, we can build a tree structure to represent the re¬ 
lationship between the Roman emperors and the RIC labels 
as depicted in Figure 2. In the Emperor layer, we com¬ 
pute probability p(e\I Q ) for Emperor e given observe im¬ 



Figure 2: Hierarchical classification for the RIC label. I Q 
and I r are the observe and reverse images, respectively. In 
the Emperor layer, we compute the probability of RIC label 
r given reverse image I r (resp., in the RIC layer, probability 
of e given I Q ). Then the final prediction is defined as the 
product of the probabilities on the path from the root to the 
leaf. 


age I 0 , and in the RIC layer, p(r\I r ) for RIC label r given 
I r . Then the final probability is defined to be the product 
of the probabilities on the path from the root to the leaf as 
p(e\I Q ) • p(r\I r ) • S(Pa(r) = e) where Pa(r) is the parent 
of node r and 5(-) is the indicator function. 

For this purpose, we train two CNN models, one for the 
RIC label taking the reverse image and the other for the 
Roman emperor taking observe image. For a given pair ob¬ 
serve and reverse images, we evaluate the probabilities on 
the nodes in the tree and choose the leaf node with the max¬ 
imum value as the prediction result. 

3.3. Finding Characteristic Landmarks on Roman 
Coins 

The coin classification problem can be considered as the 
fine-grained classification problem as all the images belong 
to one super class. Finding discriminative regions that rep¬ 
resent class characteristics plays an important role in the 
fine-grained classification. This is specifically true in the 
context of Roman coins, where domain experts (e.g., numis¬ 
matists) seek intuitive, visual landmark feedback associated 
with an otherwise automated classification task. 

In this section, we introduce our method to discover 
characteristic landmarks on the Roman coins using the 
CNN model. We define the characteristic region set as the 
smallest set of local patches sufficient to represent the iden¬ 
tity of the full image and distinguish it from other available 
classes. 

Several approaches have been presented in the past that 
attempt to identify intuitive/visual class characteristics of 
CNNs [21, 24, 2: ]. However, their main purpose is largely 
to reconstruct a representative, prototypical class image and 
not necessarily find the discriminative regions. Unlike the 
previous methods, the proposed method starts from specific 



input image and removes visual information deemed irrele¬ 
vant for the coin’s accurate classification as an instance of 
the same class. 

Let I and I(i) be the vectorized image and the it h pixel 
intensity of image I, respectively. Let r^, 1 < k < K, be 
the set of indices that belongs to the kth subregion in image 
I. The subregion could be a superpixel, a patch from the 
sliding window with overlapping, or even one pixel. We 
define 1^ to represent the kth subregion as follows: 


i *(*) 


I(i) if i e r k 
0 otherwise 


Ut=i 


k= 1 


( 2 ) 


Then we define a mask function /i(x), x = 
[xi, X 2 ,..., Xk\ T Xi G [0,1], which maps image I to 
the masked image /i(x) as a function of x: 

/i( x ) = (^2 x k ■ Ifcj ® c, (3) 

where 0 is the element-wise product and C is a normaliza¬ 
tion vector counting how many times a pixel appears across 
the subregions as 

C (i) = —=— -. 

£f=i*(ier fc ) 

Xk controls the transparency of the subregion so that 
Xk = 1 represents that the subregion has the full pixel in¬ 
tensity while Xk = 0 implies that the region is transparent. 

We would like to find an image such that the image con¬ 
sists of the smallest set of regions but still can be correctly 
classified by the original CNN model, with some small, 
controlled loss in confidence. With the definition of /i(x), 
we formulate our goal as follows: 


min 4(/i(x)) + A7£(x) (4) 

X 

s.t. p(c|/i(x)) > p(c|/i(l)) - e, (5) 

e > 0, 


where £ c (-) is the loss function of the CNN model for class 
c, 1 is a vector of all ones, TZ(-) is a regularization func¬ 
tion and A is a hyper parameter to control the regularization. 
We place the constraint so that the prediction probability of 
the masked image fi (x) may differ from the original image 
/i(l) at most e. 

Because we are interested in absence or presence of a 
region, the Lo-norm would be an ideal choice for the regu¬ 
larization function. However, it is non-differentiable, mak¬ 
ing it difficult to optimize the objective function. Therefore 
we resort to the L\ norm which is the closest convex, C° 
continuous approximation of the L 0 -norm. 

Both A and e have similar roles to control the prediction 
accuracy of the masked image. If we increase A, p (c|/i(x)) 


decreases because the optimization puts more emphasis on 
minimizing the |x|i than the loss function. Similarly, large 
e allows the low prediction accuracy of the masked image. 
Therefore, in this paper we fix A to 1 and control e because 
e can explicitly put the lower bound of the prediction accu¬ 
racy. 

We use the negative log of the soft-max function as the 
loss: 


4 (/l(x)) = - log 


f exp(ff c (/i(x)) \ 
\E C ' ex P (Sc'(/i(x)) / 


where S c is the score for class c as in (1). 

Optimization in (4) is in general a non-convex problem, 
a consequence of the non-convex CNN mapping. We ap¬ 
proach the minimization task in (4) using a general subgra¬ 
dient descent optimization with backprojection. The gradi¬ 
ent can be computed using the chain rule as: 


die = / d/i(x) \ T di c 
dx \ dx ) dfi(x)‘ 


The second component of the gradient, d£ c /dfi(x), rep¬ 
resents the sensitivity of the CNN output with respect to 
the input image (region) and can be computed by the back- 
propagation as discussed in [24]. Note that this quantity 
differs from the typical sensitivity of loss with respect to 
the CNN parameters, used in CNN training. Because /i(x) 
is a linear function of mask x, the gradient dfi(x)/dx is 
easily computed as 


9fl (x) 

dfl (x) 

a/i^x) 

dxt 

dx 2 

dx K 

% 2 (x) 

d/i 2 (x) 

d/i 2 (x) 

dx\ 

dx 2 

dx K 

(x) 

d/i K (x) 

a/f(x) 

dxi 

dx 2 

dx K , 


where /^(x) is the kth element of the masked image. 

The standard gradient descent method to minimize (4) 
may violate the constraint (5) because of the regularization 
term that enforces sparseness. Therefore, we use the back- 
projection method for the optimization. We first initialize 
x to 1 (/.£., full image), then perform the gradient descent. 
If the violation occurs, we remedy it by taking the gradient 
with respect to only the loss function without considering 
the regularization until the constraint is satisfied. 

During the optimization, the loss function and the L\ 
regularization term in (4) compete with each other under the 
constraint of (5). Minimization of the loss function alone 
typically requires a large number of regions. On the other 
hand, the regularization term attempts to select as few land¬ 
marks as possible. Because non-discriminative regions usu¬ 
ally do not contribute to minimization of the loss function, 
they are more likely to be removed than the persistent, dis¬ 
criminative regions. 
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Table 1: Classification Accuracies for SVM and CNN 


>, 0.6 

O 

CO 

^ 0.55 

O 

O 0.5 




■S' 


■i-T—i 

- CNN 

- SVM 


I-1-1-1-1-1-1-1-1-1 

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 

Iteration x1 ° 4 


Figure 3: The change of the classification accuracy for 
Reverse as a function of the iteration number (epoch). After 
40,000 iterations, the accuracy remains steady. Note that a 
small number of iterations is sufficient to outperform SVM. 


4. Experiments and Results 

In this section, we explain our experimental settings in¬ 
cluding the coin data collection. We then discuss the coin 
classification using the CNN model. Finally, we analyze the 
results of discovering characteristic landmarks on the coin 
images. 

4.1. Experimental Settings 

Data collection: We have collected ancient Roman Im¬ 
perial coin images from numismatic web sites. As we are 
dealing with the problem of recognizing given the coin im¬ 
ages, we did not consider the coins that are severely dam¬ 
aged or hard to recognize. In the next step, we removed the 
background of each coin image by a standard background 
removal method and resized it to 256 x 256. Each coin in 
the dataset has both observe (front) and reverse images. For 
the purpose of the classification, we label the coin images 
according to their RIC [6, 22] . We found that the coins with 
similar descriptions look similar to each other, making them 
almost impossible to differentiate. Therefore, if the number 
of the different words in the descriptions for the two coins 
was less than a threshold (we set it to 2 in this paper), we 
considered them as the same class and assigned the same 
label. Finally, we create a new coin dataset consisting of 
4526 coins with RIC 314 labels and 96 Roman emperors. 

Baseline method: As a baseline method, we use the 
SVM model as described in [15]. In [15], they extracted 
the SIFT descriptors [20] in the dense manner and used the 
fc-means clustering method to build the visual code book. 
Then, the image is represented as a histogram of the visual 
words from the codebook. We also use the spatial pyramid 
model [17] to exploit the spatial information. In this paper, 
we use the polar coordinate system as the spatial pyramid 
as it has shown the best performance in the previous ancient 
coin recognition approaches [1, 15]. The polar coordinate 
system models r radial scales and 6 angular orientations. 
We empirically use r = 2 and 6 = 6. 



SVM 

CNN 

Reverse 

42.76% (±1.30) 

62.28 %(±1.40) 

Observe 

62.83% (±2.42) 

69.99 %(±2.53) 

Hierarchy 

60.68% (±1.23) 

76.18 %(±2.01) 


Evaluation measure for classification: For measuring 
classification performance, we use 5-fold cross-validation 
with class-balanced partitions: we repeat the experiments 5 
times with 4 subgroups as training data and 1 subgroup as 
test data so that each of 5 subgroups becomes the test data. 
The classification accuracy is measure by the mean of the 
diagonal of the confusion matrix and we report the average 
of the 5 accuracies for 5 data splits. 

CNN settings: We use the open-source package Caffe 
[1 ] to implement our CNN model. We follow the same 
network architecture as in [16] except the final output layer 
where we replace the original 1000-way classification layer 
for our classification purpose (314-way for the RIC label 
prediction and 96-way for the Roman emperor prediction). 
We also decrease the overall learning rate while increasing 
the learning rate for the new layer so that the rest of the 
model changes slowly while keeping a stronger pace of up¬ 
dates in the final layer [13]. 

4.2. Coin Classification Results 

We first discuss the fine-tuned CNN models that we use 
in this paper. Figure 3 depicts how the classification accu¬ 
racy changes as a function of the iteration number (epoch). 
As shown in the figure, the classification accuracy remains 
steady, after 40,000 iterations. Therefore, we fine-tuned our 
CNN models over 50,000 iterations and use them thereafter. 

The classification accuracy of the CNN model on the col¬ 
lected coin dataset is given in Table 1. Reverse presents 
the task to predict the RIC label given the reverse image. 
In Hierarchy, we predict the RIC label given both the 
observe and reverse images using the hierarchical classifi¬ 
cation method as we discussed in Section 3.2. We also show 
the classification accuracy for Observe which represents 
the task to predict the Roman emperor given the observe 
image. Because the number of the emperors is less than 
the number of the RIC labels, Observe is easier than the 
other tasks. Hierarchy can get benefit from performing 
the easy task (the emperor prediction) first followed by the 
more difficult task of RIC prediction. 

CNN significantly outperforms SVM in all three tasks 
leading to up to 20% increase in accuracy. Specifically, 
CNN shows most significant improvement on Reverse 
side RIC classification for two reasons. First, there are 
significantly fewer emperors (96) than RIC labels (314). 
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Figure 4: Confusion matrices of CNN and SVM for 
Reverse and Hierarchy. In both models, Hierarchy 
performs better than Reverse. CNN Hierarchy has 
improved the classification accuracies across all the RIC la¬ 
bels as it takes an advantage of the hierarchical structure of 
the RIC labels. For visualization, the smoothed heat map is 
used. 

Next, the structure of Reverse side is typically more com¬ 
plex than that depicted on Observe, consisting of well- 
structured face profiles. The convolutional feature of CNN 
is able to more effectively exploit the spatial information 
than the spatial histogram used in SVM. Coins with the 
same RIC label have few consistent characteristic landmark 
regions and the CNN model is able to locate them effec¬ 
tively. On the other hand, SVM has to depend on the fixed 
structure of the spatial pyramid model which may not be ap¬ 
propriate for some specific RIC labels. We will discuss the 
recovery of the discriminative regions found by the CNN 
models in Section 4.3. 

The confusion matrices for the classification of the RIC 
label are depicted in Figure 4. Hierarchy outperforms 
Reverse in both CNN and SVM models as it exploits the 
hierarchical structure of the RIC labels. To better under¬ 
stand this phenomenon, we select two classes that are con¬ 
fused by Reverse but Hierarchy can distinguish them 
correctly as shown in Figure 5. The confusion caused by 
the similarity between the reverse images can be removed 
using the differently depicted observe images. 

4.3. Discriminative Regions and Landmarks 

In this section, we first examine how the selected regions 
and confidence values from the CNN model change as a 
function of e. For this purpose, we choose one reverse and 
one observe images and vary e from 0.1 to 1.0. Note that 
e = 1 implies that the constraint in (4) will never be vio¬ 
lated. 

Figure 6 shows the visualization of discovered land¬ 
marks as a function of e. Because larger e allows smaller 
confidence value, the total area of the characteristic parts 
becomes smaller, i.e. very essential parts are remained. 
Therefore as we increase e, relatively less significant re¬ 
gions are first removed on the coin. For example, Venus 
in the upper panel holds a small apple which is considered 


Class a Class b 



Figure 5: Example of two classes where confusion caused 
by Reverse is resolved by Hierarchy. The reverse im¬ 
ages for the two classes look similar to each other. On the 
contrary, the observe images make them distinguishable. 

as the characteristic part at first. However, as we increase 
6, the size of the discriminative areas becomes smaller and 
finally the apple turns out less significant than the toss. 

When e = 1, no constraint is placed during the optimiza¬ 
tion. Therefore, the gradient decent method tries to find the 
mask as sparse as possible without considering the correct 
prediction, having the discovered regions meaningless. 

On the other hand, the discriminative regions on the ob¬ 
serve images change slowly. Unlike the reverse where dif¬ 
ferent characteristic symbols appear in variable locations, 
the observe images have common structures, i.e. profiles of 
the Roman emperors. Therefore, the observe images need 
more parts to remain distinguishable from the others than 
the reverse images. As shown in Figure 6, head and bust 
remains present for all e values. 

Figure 7 depicts the visualization of the discovered land¬ 
marks on both reverse and observe images with two dif¬ 
ferent sliding windows (11 x 11 and 21 x 21). We set e 
to 0.5 and choose the coins that are correctly classified by 
the CNNs for the experiments. The results confirm that the 
proposed method is robust with respect to the window sizes. 
Moreover, the coins with the same RIC label, (a) and (b), (c) 
and (d), (g) and (h) in Figure 7, share the similar landmarks. 
The results imply that there exists a set of characteristic re¬ 
gions per class, class-specific discriminative regions. As we 
will see next, such regions indeed point to intuitive visual 
landmarks associated with RIC descriptions. 

Qualitative analysis: There is no ground truth informa¬ 
tion available for the discriminative regions. Therefore, we 
qualitatively analyze our proposed method with two differ¬ 
ent schemes. First, we qualitatively compare our proposed 
method with recently proposed approaches [24, 30]. In [30], 
they identify which regions of the image lead to the high 
unit activations by replicating an image many times with 
small occluders at different locations in the image and mea¬ 
suring the discrepancy of the unit activations between the 
original image and the occluded images. An image patch 







e = 0.1 e = 0.3 e = 0.5 e = 0.7 e = 1 



Pc(/i(x*)) = 0.907 p c (/i(x*)) = 0.734 p c (/i(x*)) = 0.616 p c (/i(x*)) = 0.330 p c (/i(x*)) = 0.002 

e = 0.1 e = 0.3 e = 0.5 e = 0.7 e = l 



Pc(/i(x*)) = 0.912 p c (/i(x*)) = 0.791 p c (/i(x*)) = 0.712 p c (/i( x *)) = 0.561 p c (/i(x*)) = 0.000 


Figure 6: Visualizatoin of x* as a function of e, where x* is the optimized solution. As e increase, classification probability 
p c {’) for c l ass c an d landmark areas decrease. All the masked images are correctly identified by the CNNs except e = 1 
because it places no constraint for the correct classification and may lead to wrong prediction results. For visualization, we 
rescale x* to the range of [0,1]. 


which leads to large discrepancy can be considered as im¬ 
portant to the unit. We use two different sizes of the image 
patches (11 x 11 and 21 x 21) and measure the difference 
of the class score (S c in (6)) on a dense grid with a stride of 
3. 

The saliency extraction method which computes the gra¬ 
dient of the CNNs with respect to the image was proposed 
in [2^ ]. The single pass of the back propagation is used 
to find the saliency map. For fair comparison, we perform 
the moving average with the same patch size as the other 
experiments subsequent to back propagation. 

The experimental results in Figure 7 show that our 
method and [30] largely agree with each other. This im¬ 
plies that our method is able to find the important regions 
which lead to large discrepancy or, equivalently, significant 
changes in classification accuracy. On the other hand, the 
saliency extraction method [24] tries to find strong edge ar¬ 
eas without considering class-characteristics. For example, 
it fails to find the shield in Figure 7a and 7b, which both our 
method and [30] are able to discover. 

Nevertheless, our proposed method has a distinct advan¬ 
tage over the occlusion-based approach as in [30] in terms 
of computational time. The method in [30] requires a very 
large number of CNN evaluations (e.g., more than 5000 for 
an image of 256 x 256). On the other hand, the proposed 
method is based on the optimization formulation, usually 
converging in fewer than 100 iterations, while identifying 
qualitatively similar landmarks. 

Next, we use the coin descriptions from RIC to analyze 


the proposed method. For this purpose, we remove stop 
words in the RIC descriptions and list the remaining words 
as depicted in Figure 7. The selected landmarks by the pro¬ 
posed method strongly correlate with the descriptions, such 
as the shield found in Figure 7a, 7b, 7c and 7d. On the 
other hand, the apple in Figure 7f is successfully found by 
our method while the others fail to find it or discover it with 
little confidence. This attests to the practical utility of the 
proposed approach in identifying the landmarks consistent 
with human expert annotations. In addition, the proposed 
method may assist non-experts in generating a visual guide¬ 
book to identify the ancient Roman coins without specific 
domain expertise. 

5. Conclusion 

We proposed a novel method to discover the character¬ 
istic landmarks of the ancient Roman imperial coins. Our 
method automatically finds the smallest set of the discrimi¬ 
native regions sufficient to represent the identity of the full 
image and distinguish it from other available classes. The 
qualitative analysis on the visualization of the discovered 
regions confirm that the proposed method is able to effec¬ 
tively find the class-specific regions but also it is consistent 
with the human expert annotations. The proposed frame¬ 
work to identify the ancient Roman imperial coins outper¬ 
forms the previous approach in the domain of the coin clas¬ 
sification by using the hierarchical structure of the RIC la¬ 
bels. 
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Figure 7: Visualization of discovered landmarks for reverse and observe images. Red denotes more discriminative, blue less 
significant. The proposed method and [3C ] agree with each other while the saliency extraction [24] focuses on strong edge 
areas. Note that the discovered regions are correlated with descriptions from human expert annotations. For visualization, 
we rescale x* to the range of [0,1]. 
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